This is an extensive Exploratory Data Analysis for the Porto Seguro’s Safe Driver Prediction competition within the R environment of the tidyverse and ggplot2. We will visualise all the different data features, their relation to the target variable, explore multi-parameter interactions, and perform feature engineering.
The aim of this challenge is to predict the probability whether a driver will make an insurance claim, with the purpose of providing a fairer insurance cost on the basis of individual driving habits. It is sponsored by Porto Seguro - a major car and home insurance company in Brazil
The data comes in the traditional Kaggle form of one training and test file each: ../input/train.csv & ../input/test.csv. Each row corresponds to a specific policy holder and the columns describe their features. The target variable is conveniently named target here and it indicates whether this policy holder made an insurance claim in the past.
We load a range of libraries for general data wrangling and general visualisation together with more specialised tools.
# general visualisation
library('ggplot2') # visualisation
library('scales') # visualisation
library('grid') # visualisation
library('ggthemes') # visualisation
library('gridExtra') # visualisation
library('RColorBrewer') # visualisation
library('corrplot') # visualisation
# general data manipulation
library('dplyr') # data manipulation
library('readr') # input/output
library('data.table') # data manipulation
library('tibble') # data wrangling
library('tidyr') # data wrangling
library('stringr') # string manipulation
library('forcats') # factor manipulation
library('rlang') # data manipulation
# specific visualisation
library('alluvial') # visualisation
library('ggfortify') # visualisation
## Warning: namespace 'DBI' is not available and has been replaced
## by .GlobalEnv when processing object 'call.'
## Warning: namespace 'DBI' is not available and has been replaced
## by .GlobalEnv when processing object 'call.'
library('ggrepel') # visualisation
library('ggridges') # visualisation
library('VIM') # NAs
library('plotly') # interactive
library('ggforce') # visualisation
# modelling
library('xgboost') # modelling
library('caret') # modelling
library('MLmetrics') # gini metric
We use the multiplot function, courtesy of R Cookbooks to create multi-panel plots. We also make use of a brief helper function to compute binomial confidence intervals.
# Define multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols: Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
# Make a list from the ... arguments and plotlist
plots <- c(list(...), plotlist)
numPlots = length(plots)
# If layout is NULL, then use 'cols' to determine layout
if (is.null(layout)) {
# Make the panel
# ncol: Number of columns of plots
# nrow: Number of rows needed, calculated from # of cols
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}
if (numPlots==1) {
print(plots[[1]])
} else {
# Set up the page
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
# Make each plot, in the correct location
for (i in 1:numPlots) {
# Get the i,j matrix positions of the regions that contain this subplot
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
}
# function to extract binomial confidence levels
get_binCI <- function(x,n) as.list(setNames(binom.test(x,n)$conf.int, c("lwr", "upr")))
We use data.table’s fread function to speed up reading in the data, even though in this challenge our files are not very large with about 110 MB for train and 165 MB for test. Here we are taking into account the fact that missing values in the original data sets are indicated by -1 or -1.0 and turn those into “proper” NAs.
train <- as.tibble(fread('input/train.csv', na.strings=c("-1","-1.0")))
test <- as.tibble(fread('input/test.csv', na.strings=c("-1","-1.0")))
sample_submit <- as.tibble(fread('input/sample_submission.csv'))
As a first step let’s have an overview of the data sets using the summary and glimpse tools.
summary(train)
## id target ps_ind_01 ps_ind_02_cat
## Min. : 7 Min. :0.00000 Min. :0.0 Min. :1.00
## 1st Qu.: 371992 1st Qu.:0.00000 1st Qu.:0.0 1st Qu.:1.00
## Median : 743548 Median :0.00000 Median :1.0 Median :1.00
## Mean : 743804 Mean :0.03645 Mean :1.9 Mean :1.36
## 3rd Qu.:1115549 3rd Qu.:0.00000 3rd Qu.:3.0 3rd Qu.:2.00
## Max. :1488027 Max. :1.00000 Max. :7.0 Max. :4.00
## NA's :216
## ps_ind_03 ps_ind_04_cat ps_ind_05_cat ps_ind_06_bin
## Min. : 0.000 Min. :0.000 Min. :0.000 Min. :0.0000
## 1st Qu.: 2.000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.0000
## Median : 4.000 Median :0.000 Median :0.000 Median :0.0000
## Mean : 4.423 Mean :0.417 Mean :0.419 Mean :0.3937
## 3rd Qu.: 6.000 3rd Qu.:1.000 3rd Qu.:0.000 3rd Qu.:1.0000
## Max. :11.000 Max. :1.000 Max. :6.000 Max. :1.0000
## NA's :83 NA's :5809
## ps_ind_07_bin ps_ind_08_bin ps_ind_09_bin ps_ind_10_bin
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.000000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000000
## Median :0.000 Median :0.0000 Median :0.0000 Median :0.000000
## Mean :0.257 Mean :0.1639 Mean :0.1853 Mean :0.000373
## 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.000000
## Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.000000
##
## ps_ind_11_bin ps_ind_12_bin ps_ind_13_bin
## Min. :0.000000 Min. :0.000000 Min. :0.0000000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.0000000
## Median :0.000000 Median :0.000000 Median :0.0000000
## Mean :0.001692 Mean :0.009439 Mean :0.0009476
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.0000000
## Max. :1.000000 Max. :1.000000 Max. :1.0000000
##
## ps_ind_14 ps_ind_15 ps_ind_16_bin ps_ind_17_bin
## Min. :0.00000 Min. : 0.0 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.: 5.0 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.00000 Median : 7.0 Median :1.0000 Median :0.0000
## Mean :0.01245 Mean : 7.3 Mean :0.6608 Mean :0.1211
## 3rd Qu.:0.00000 3rd Qu.:10.0 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :4.00000 Max. :13.0 Max. :1.0000 Max. :1.0000
##
## ps_ind_18_bin ps_reg_01 ps_reg_02 ps_reg_03
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.06
## 1st Qu.:0.0000 1st Qu.:0.400 1st Qu.:0.2000 1st Qu.:0.63
## Median :0.0000 Median :0.700 Median :0.3000 Median :0.80
## Mean :0.1534 Mean :0.611 Mean :0.4392 Mean :0.89
## 3rd Qu.:0.0000 3rd Qu.:0.900 3rd Qu.:0.6000 3rd Qu.:1.08
## Max. :1.0000 Max. :0.900 Max. :1.8000 Max. :4.04
## NA's :107772
## ps_car_01_cat ps_car_02_cat ps_car_03_cat ps_car_04_cat
## Min. : 0.000 Min. :0.0000 Min. :0.0 Min. :0.0000
## 1st Qu.: 7.000 1st Qu.:1.0000 1st Qu.:0.0 1st Qu.:0.0000
## Median : 7.000 Median :1.0000 Median :1.0 Median :0.0000
## Mean : 8.298 Mean :0.8299 Mean :0.6 Mean :0.7252
## 3rd Qu.:11.000 3rd Qu.:1.0000 3rd Qu.:1.0 3rd Qu.:0.0000
## Max. :11.000 Max. :1.0000 Max. :1.0 Max. :9.0000
## NA's :107 NA's :5 NA's :411231
## ps_car_05_cat ps_car_06_cat ps_car_07_cat ps_car_08_cat
## Min. :0.00 Min. : 0.000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.00 1st Qu.: 1.000 1st Qu.:1.000 1st Qu.:1.0000
## Median :1.00 Median : 7.000 Median :1.000 Median :1.0000
## Mean :0.53 Mean : 6.555 Mean :0.948 Mean :0.8321
## 3rd Qu.:1.00 3rd Qu.:11.000 3rd Qu.:1.000 3rd Qu.:1.0000
## Max. :1.00 Max. :17.000 Max. :1.000 Max. :1.0000
## NA's :266551 NA's :11489
## ps_car_09_cat ps_car_10_cat ps_car_11_cat ps_car_11
## Min. :0.000 Min. :0.0000 Min. : 1.00 Min. :0.000
## 1st Qu.:0.000 1st Qu.:1.0000 1st Qu.: 32.00 1st Qu.:2.000
## Median :2.000 Median :1.0000 Median : 65.00 Median :3.000
## Mean :1.331 Mean :0.9921 Mean : 62.22 Mean :2.346
## 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.: 93.00 3rd Qu.:3.000
## Max. :4.000 Max. :2.0000 Max. :104.00 Max. :3.000
## NA's :569 NA's :5
## ps_car_12 ps_car_13 ps_car_14 ps_car_15
## Min. :0.1000 Min. :0.2506 Min. :0.11 Min. :0.000
## 1st Qu.:0.3162 1st Qu.:0.6709 1st Qu.:0.35 1st Qu.:2.828
## Median :0.3742 Median :0.7658 Median :0.37 Median :3.317
## Mean :0.3799 Mean :0.8133 Mean :0.37 Mean :3.066
## 3rd Qu.:0.4000 3rd Qu.:0.9062 3rd Qu.:0.40 3rd Qu.:3.606
## Max. :1.2649 Max. :3.7206 Max. :0.64 Max. :3.742
## NA's :1 NA's :42620
## ps_calc_01 ps_calc_02 ps_calc_03 ps_calc_04
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.2000 1st Qu.:0.2000 1st Qu.:0.2000 1st Qu.:2.000
## Median :0.5000 Median :0.4000 Median :0.5000 Median :2.000
## Mean :0.4498 Mean :0.4496 Mean :0.4498 Mean :2.372
## 3rd Qu.:0.7000 3rd Qu.:0.7000 3rd Qu.:0.7000 3rd Qu.:3.000
## Max. :0.9000 Max. :0.9000 Max. :0.9000 Max. :5.000
##
## ps_calc_05 ps_calc_06 ps_calc_07 ps_calc_08
## Min. :0.000 Min. : 0.000 Min. :0.000 Min. : 2.000
## 1st Qu.:1.000 1st Qu.: 7.000 1st Qu.:2.000 1st Qu.: 8.000
## Median :2.000 Median : 8.000 Median :3.000 Median : 9.000
## Mean :1.886 Mean : 7.689 Mean :3.006 Mean : 9.226
## 3rd Qu.:3.000 3rd Qu.: 9.000 3rd Qu.:4.000 3rd Qu.:10.000
## Max. :6.000 Max. :10.000 Max. :9.000 Max. :12.000
##
## ps_calc_09 ps_calc_10 ps_calc_11 ps_calc_12
## Min. :0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:1.000 1st Qu.: 6.000 1st Qu.: 4.000 1st Qu.: 1.000
## Median :2.000 Median : 8.000 Median : 5.000 Median : 1.000
## Mean :2.339 Mean : 8.434 Mean : 5.441 Mean : 1.442
## 3rd Qu.:3.000 3rd Qu.:10.000 3rd Qu.: 7.000 3rd Qu.: 2.000
## Max. :7.000 Max. :25.000 Max. :19.000 Max. :10.000
##
## ps_calc_13 ps_calc_14 ps_calc_15_bin ps_calc_16_bin
## Min. : 0.000 Min. : 0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 2.000 1st Qu.: 6.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 3.000 Median : 7.000 Median :0.0000 Median :1.0000
## Mean : 2.872 Mean : 7.539 Mean :0.1224 Mean :0.6278
## 3rd Qu.: 4.000 3rd Qu.: 9.000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :13.000 Max. :23.000 Max. :1.0000 Max. :1.0000
##
## ps_calc_17_bin ps_calc_18_bin ps_calc_19_bin ps_calc_20_bin
## Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000
## Median :1.0000 Median :0.0000 Median :0.000 Median :0.0000
## Mean :0.5542 Mean :0.2872 Mean :0.349 Mean :0.1533
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.000 Max. :1.0000
##
glimpse(train)
## Observations: 595,212
## Variables: 59
## $ id <int> 7, 9, 13, 16, 17, 19, 20, 22, 26, 28, 34, 35, 3...
## $ target <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_01 <int> 2, 1, 5, 0, 0, 5, 2, 5, 5, 1, 5, 2, 2, 1, 5, 5,...
## $ ps_ind_02_cat <int> 2, 1, 4, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1,...
## $ ps_ind_03 <int> 5, 7, 9, 2, 0, 4, 3, 4, 3, 2, 2, 3, 1, 3, 11, 3...
## $ ps_ind_04_cat <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1,...
## $ ps_ind_05_cat <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_06_bin <int> 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_07_bin <int> 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1,...
## $ ps_ind_08_bin <int> 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,...
## $ ps_ind_09_bin <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ ps_ind_10_bin <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_11_bin <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_12_bin <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_13_bin <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_14 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_15 <int> 11, 3, 12, 8, 9, 6, 8, 13, 6, 4, 3, 9, 10, 12, ...
## $ ps_ind_16_bin <int> 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,...
## $ ps_ind_17_bin <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_18_bin <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1,...
## $ ps_reg_01 <dbl> 0.7, 0.8, 0.0, 0.9, 0.7, 0.9, 0.6, 0.7, 0.9, 0....
## $ ps_reg_02 <dbl> 0.2, 0.4, 0.0, 0.2, 0.6, 1.8, 0.1, 0.4, 0.7, 1....
## $ ps_reg_03 <dbl> 0.7180703, 0.7660777, NA, 0.5809475, 0.8407586,...
## $ ps_car_01_cat <int> 10, 11, 7, 7, 11, 10, 6, 11, 10, 11, 11, 11, 6,...
## $ ps_car_02_cat <int> 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1,...
## $ ps_car_03_cat <int> NA, NA, NA, 0, NA, NA, NA, 0, NA, 0, NA, NA, NA...
## $ ps_car_04_cat <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 8, 0, 0, 0, 0, 9,...
## $ ps_car_05_cat <int> 1, NA, NA, 1, NA, 0, 1, 0, 1, 0, NA, NA, NA, 1,...
## $ ps_car_06_cat <int> 4, 11, 14, 11, 14, 14, 11, 11, 14, 14, 13, 11, ...
## $ ps_car_07_cat <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ ps_car_08_cat <int> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0,...
## $ ps_car_09_cat <int> 0, 2, 2, 3, 2, 0, 0, 2, 0, 2, 2, 0, 2, 2, 2, 0,...
## $ ps_car_10_cat <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ ps_car_11_cat <int> 12, 19, 60, 104, 82, 104, 99, 30, 68, 104, 20, ...
## $ ps_car_11 <int> 2, 3, 1, 1, 3, 2, 2, 3, 3, 2, 3, 3, 3, 3, 1, 2,...
## $ ps_car_12 <dbl> 0.4000000, 0.3162278, 0.3162278, 0.3741657, 0.3...
## $ ps_car_13 <dbl> 0.8836789, 0.6188165, 0.6415857, 0.5429488, 0.5...
## $ ps_car_14 <dbl> 0.3708099, 0.3887158, 0.3472751, 0.2949576, 0.3...
## $ ps_car_15 <dbl> 3.605551, 2.449490, 3.316625, 2.000000, 2.00000...
## $ ps_calc_01 <dbl> 0.6, 0.3, 0.5, 0.6, 0.4, 0.7, 0.2, 0.1, 0.9, 0....
## $ ps_calc_02 <dbl> 0.5, 0.1, 0.7, 0.9, 0.6, 0.8, 0.6, 0.5, 0.8, 0....
## $ ps_calc_03 <dbl> 0.2, 0.3, 0.1, 0.1, 0.0, 0.4, 0.5, 0.1, 0.6, 0....
## $ ps_calc_04 <int> 3, 2, 2, 2, 2, 3, 2, 1, 3, 2, 2, 2, 4, 2, 3, 2,...
## $ ps_calc_05 <int> 1, 1, 2, 4, 2, 1, 2, 2, 1, 2, 3, 2, 1, 1, 1, 1,...
## $ ps_calc_06 <int> 10, 9, 9, 7, 6, 8, 8, 7, 7, 8, 8, 8, 8, 10, 8, ...
## $ ps_calc_07 <int> 1, 5, 1, 1, 3, 2, 1, 1, 3, 2, 2, 2, 4, 1, 2, 5,...
## $ ps_calc_08 <int> 10, 8, 8, 8, 10, 11, 8, 6, 9, 9, 9, 10, 11, 8, ...
## $ ps_calc_09 <int> 1, 1, 2, 4, 2, 3, 3, 1, 4, 1, 4, 1, 1, 3, 3, 2,...
## $ ps_calc_10 <int> 5, 7, 7, 2, 12, 8, 10, 13, 11, 11, 7, 8, 9, 8, ...
## $ ps_calc_11 <int> 9, 3, 4, 2, 3, 4, 3, 7, 4, 3, 6, 9, 6, 2, 4, 5,...
## $ ps_calc_12 <int> 1, 1, 2, 2, 1, 2, 0, 1, 2, 5, 3, 2, 3, 0, 1, 2,...
## $ ps_calc_13 <int> 5, 1, 7, 4, 1, 0, 0, 3, 1, 0, 3, 1, 3, 4, 3, 6,...
## $ ps_calc_14 <int> 8, 9, 7, 9, 3, 9, 10, 6, 5, 6, 6, 10, 8, 3, 9, ...
## $ ps_calc_15_bin <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_calc_16_bin <int> 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1,...
## $ ps_calc_17_bin <int> 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1,...
## $ ps_calc_18_bin <int> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ ps_calc_19_bin <int> 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1,...
## $ ps_calc_20_bin <int> 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,...
We find:
There are lots of features here. In total, our training data has 59 variables, including id and target. In some of them we already see a number of NAs.
The data description mentions that the names of the features indicate whether they are binary (bin) or categorical (cat) variables. Everything else is continuous or ordinal.
We have already been told by Adriano Moala that the names of the variables indicate certain properties: *Ind" is related to individual or driver, “reg” is related to region, “car” is related to car itself and “calc” is an calculated feature.’ Here we will refer to these properties as groups.
Note, that there is a ps_car_11 as well as a ps_car_11_cat. This is the only occasion where the numbering per group is neither consecutive nor unique. Probably a typo in the script that created the variable names.
The features are anonymised, because that worked so well in Mercedes. Just kidding. Mostly ;-)
summary(test)
## id ps_ind_01 ps_ind_02_cat ps_ind_03
## Min. : 0 Min. :0.000 Min. :1.000 Min. : 0.000
## 1st Qu.: 372022 1st Qu.:0.000 1st Qu.:1.000 1st Qu.: 2.000
## Median : 744307 Median :1.000 Median :1.000 Median : 4.000
## Mean : 744154 Mean :1.902 Mean :1.359 Mean : 4.414
## 3rd Qu.:1116309 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.: 6.000
## Max. :1488026 Max. :7.000 Max. :4.000 Max. :11.000
## NA's :307
## ps_ind_04_cat ps_ind_05_cat ps_ind_06_bin ps_ind_07_bin
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.000 Median :0.0000 Median :0.0000
## Mean :0.4176 Mean :0.422 Mean :0.3932 Mean :0.2572
## 3rd Qu.:1.0000 3rd Qu.:0.000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :6.000 Max. :1.0000 Max. :1.0000
## NA's :145 NA's :8710
## ps_ind_08_bin ps_ind_09_bin ps_ind_10_bin ps_ind_11_bin
## Min. :0.0000 Min. :0.0000 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.0000 Median :0.0000 Median :0.000000 Median :0.000000
## Mean :0.1637 Mean :0.1859 Mean :0.000373 Mean :0.001595
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.000000 3rd Qu.:0.000000
## Max. :1.0000 Max. :1.0000 Max. :1.000000 Max. :1.000000
##
## ps_ind_12_bin ps_ind_13_bin ps_ind_14 ps_ind_15
## Min. :0.000000 Min. :0.000000 Min. :0.00000 Min. : 0.000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.: 5.000
## Median :0.000000 Median :0.000000 Median :0.00000 Median : 7.000
## Mean :0.009376 Mean :0.001039 Mean :0.01238 Mean : 7.297
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:10.000
## Max. :1.000000 Max. :1.000000 Max. :4.00000 Max. :13.000
##
## ps_ind_16_bin ps_ind_17_bin ps_ind_18_bin ps_reg_01
## Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.4000
## Median :1.0000 Median :0.0000 Median :0.000 Median :0.7000
## Mean :0.6606 Mean :0.1204 Mean :0.155 Mean :0.6111
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.000 3rd Qu.:0.9000
## Max. :1.0000 Max. :1.0000 Max. :1.000 Max. :0.9000
##
## ps_reg_02 ps_reg_03 ps_car_01_cat ps_car_02_cat
## Min. :0.0000 Min. :0.06 Min. : 0.000 Min. :0.00
## 1st Qu.:0.2000 1st Qu.:0.63 1st Qu.: 7.000 1st Qu.:1.00
## Median :0.3000 Median :0.80 Median : 7.000 Median :1.00
## Mean :0.4399 Mean :0.89 Mean : 8.294 Mean :0.83
## 3rd Qu.:0.6000 3rd Qu.:1.09 3rd Qu.:11.000 3rd Qu.:1.00
## Max. :1.8000 Max. :4.42 Max. :11.000 Max. :1.00
## NA's :161684 NA's :160 NA's :5
## ps_car_03_cat ps_car_04_cat ps_car_05_cat ps_car_06_cat
## Min. :0.0 Min. :0.0000 Min. :0.0 Min. : 0.000
## 1st Qu.:0.0 1st Qu.:0.0000 1st Qu.:0.0 1st Qu.: 1.000
## Median :1.0 Median :0.0000 Median :1.0 Median : 7.000
## Mean :0.6 Mean :0.7258 Mean :0.5 Mean : 6.564
## 3rd Qu.:1.0 3rd Qu.:0.0000 3rd Qu.:1.0 3rd Qu.:11.000
## Max. :1.0 Max. :9.0000 Max. :1.0 Max. :17.000
## NA's :616911 NA's :400359
## ps_car_07_cat ps_car_08_cat ps_car_09_cat ps_car_10_cat
## Min. :0.000 Min. :0.0000 Min. :0.00 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:1.0000 1st Qu.:0.00 1st Qu.:1.0000
## Median :1.000 Median :1.0000 Median :2.00 Median :1.0000
## Mean :0.948 Mean :0.8323 Mean :1.33 Mean :0.9921
## 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:2.00 3rd Qu.:1.0000
## Max. :1.000 Max. :1.0000 Max. :4.00 Max. :2.0000
## NA's :17331 NA's :877
## ps_car_11_cat ps_car_11 ps_car_12 ps_car_13
## Min. : 1.00 Min. :0.000 Min. :0.1414 Min. :0.2758
## 1st Qu.: 32.00 1st Qu.:2.000 1st Qu.:0.3162 1st Qu.:0.6712
## Median : 65.00 Median :3.000 Median :0.3742 Median :0.7661
## Mean : 62.28 Mean :2.347 Mean :0.3800 Mean :0.8136
## 3rd Qu.: 94.00 3rd Qu.:3.000 3rd Qu.:0.4000 3rd Qu.:0.9061
## Max. :104.00 Max. :3.000 Max. :1.2649 Max. :4.0313
## NA's :1
## ps_car_14 ps_car_15 ps_calc_01 ps_calc_02
## Min. :0.11 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.35 1st Qu.:2.828 1st Qu.:0.2000 1st Qu.:0.2000
## Median :0.37 Median :3.317 Median :0.4000 Median :0.5000
## Mean :0.37 Mean :3.068 Mean :0.4496 Mean :0.4505
## 3rd Qu.:0.40 3rd Qu.:3.606 3rd Qu.:0.7000 3rd Qu.:0.7000
## Max. :0.64 Max. :3.742 Max. :0.9000 Max. :0.9000
## NA's :63805
## ps_calc_03 ps_calc_04 ps_calc_05 ps_calc_06
## Min. :0.0000 Min. :0.000 Min. :0.000 Min. : 1.000
## 1st Qu.:0.2000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.: 7.000
## Median :0.4000 Median :2.000 Median :2.000 Median : 8.000
## Mean :0.4501 Mean :2.371 Mean :1.885 Mean : 7.688
## 3rd Qu.:0.7000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.: 9.000
## Max. :0.9000 Max. :5.000 Max. :6.000 Max. :10.000
##
## ps_calc_07 ps_calc_08 ps_calc_09 ps_calc_10
## Min. :0.00 Min. : 1.000 Min. :0.000 Min. : 0.000
## 1st Qu.:2.00 1st Qu.: 8.000 1st Qu.:1.000 1st Qu.: 6.000
## Median :3.00 Median : 9.000 Median :2.000 Median : 8.000
## Mean :3.01 Mean : 9.226 Mean :2.339 Mean : 8.443
## 3rd Qu.:4.00 3rd Qu.:10.000 3rd Qu.:3.000 3rd Qu.:10.000
## Max. :9.00 Max. :12.000 Max. :7.000 Max. :25.000
##
## ps_calc_11 ps_calc_12 ps_calc_13 ps_calc_14
## Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.: 4.000 1st Qu.: 1.00 1st Qu.: 2.000 1st Qu.: 6.00
## Median : 5.000 Median : 1.00 Median : 3.000 Median : 7.00
## Mean : 5.438 Mean : 1.44 Mean : 2.875 Mean : 7.54
## 3rd Qu.: 7.000 3rd Qu.: 2.00 3rd Qu.: 4.000 3rd Qu.: 9.00
## Max. :20.000 Max. :11.00 Max. :15.000 Max. :28.00
##
## ps_calc_15_bin ps_calc_16_bin ps_calc_17_bin ps_calc_18_bin
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :1.0000 Median :1.0000 Median :0.0000
## Mean :0.1237 Mean :0.6278 Mean :0.5547 Mean :0.2878
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## ps_calc_19_bin ps_calc_20_bin
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000
## Mean :0.3493 Mean :0.1524
## 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000
##
glimpse(test)
## Observations: 892,816
## Variables: 58
## $ id <int> 0, 1, 2, 3, 4, 5, 6, 8, 10, 11, 12, 14, 15, 18,...
## $ ps_ind_01 <int> 0, 4, 5, 0, 5, 0, 0, 0, 0, 1, 0, 1, 1, 3, 0, 2,...
## $ ps_ind_02_cat <int> 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,...
## $ ps_ind_03 <int> 8, 5, 3, 6, 7, 6, 3, 0, 7, 6, 5, 4, 2, 3, 1, 2,...
## $ ps_ind_04_cat <int> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ ps_ind_05_cat <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 6, 0,...
## $ ps_ind_06_bin <int> 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0,...
## $ ps_ind_07_bin <int> 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0,...
## $ ps_ind_08_bin <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ ps_ind_09_bin <int> 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_10_bin <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_11_bin <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_12_bin <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_13_bin <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_14 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_15 <int> 12, 5, 10, 4, 4, 10, 11, 7, 6, 7, 3, 9, 8, 0, 8...
## $ ps_ind_16_bin <int> 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1,...
## $ ps_ind_17_bin <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ ps_ind_18_bin <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ ps_reg_01 <dbl> 0.5, 0.9, 0.4, 0.1, 0.9, 0.9, 0.1, 0.9, 0.4, 0....
## $ ps_reg_02 <dbl> 0.3, 0.5, 0.0, 0.2, 0.4, 0.5, 0.1, 1.1, 0.0, 1....
## $ ps_reg_03 <dbl> 0.6103278, 0.7713624, 0.9161741, NA, 0.8177714,...
## $ ps_car_01_cat <int> 7, 4, 11, 7, 11, 9, 6, 7, 11, 11, 11, 11, 11, 1...
## $ ps_car_02_cat <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1,...
## $ ps_car_03_cat <int> NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, ...
## $ ps_car_04_cat <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 2,...
## $ ps_car_05_cat <int> NA, 0, NA, NA, NA, NA, 0, NA, 0, NA, NA, NA, 1,...
## $ ps_car_06_cat <int> 1, 11, 14, 1, 11, 11, 1, 11, 2, 4, 11, 7, 6, 1,...
## $ ps_car_07_cat <int> 1, 1, 1, 1, 1, 0, 1, 1, NA, 1, 1, 1, 1, 1, 1, 1...
## $ ps_car_08_cat <int> 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1,...
## $ ps_car_09_cat <int> 2, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 1, 0, 2,...
## $ ps_car_10_cat <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ ps_car_11_cat <int> 65, 103, 29, 40, 101, 11, 10, 103, 104, 104, 10...
## $ ps_car_11 <int> 1, 1, 3, 2, 3, 2, 2, 3, 2, 2, 3, 3, 2, 1, 2, 0,...
## $ ps_car_12 <dbl> 0.3162278, 0.3162278, 0.4000000, 0.3741657, 0.3...
## $ ps_car_13 <dbl> 0.6695564, 0.6063200, 0.8962387, 0.6521104, 0.8...
## $ ps_car_14 <dbl> 0.3521363, 0.3583295, 0.3984972, 0.3814446, 0.3...
## $ ps_car_15 <dbl> 3.464102, 2.828427, 3.316625, 2.449490, 3.31662...
## $ ps_calc_01 <dbl> 0.1, 0.4, 0.6, 0.1, 0.9, 0.7, 0.9, 0.8, 0.9, 0....
## $ ps_calc_02 <dbl> 0.8, 0.5, 0.6, 0.5, 0.6, 0.9, 0.8, 0.9, 0.3, 0....
## $ ps_calc_03 <dbl> 0.6, 0.4, 0.6, 0.5, 0.8, 0.4, 0.8, 0.5, 0.0, 0....
## $ ps_calc_04 <int> 1, 3, 2, 2, 3, 2, 1, 2, 2, 2, 2, 2, 3, 2, 2, 1,...
## $ ps_calc_05 <int> 1, 3, 3, 1, 4, 1, 1, 2, 2, 1, 2, 2, 1, 2, 3, 4,...
## $ ps_calc_06 <int> 6, 8, 7, 7, 7, 9, 7, 8, 9, 7, 5, 9, 8, 7, 7, 8,...
## $ ps_calc_07 <int> 3, 4, 4, 3, 1, 5, 3, 4, 7, 1, 5, 3, 0, 1, 3, 3,...
## $ ps_calc_08 <int> 6, 10, 6, 12, 10, 9, 9, 11, 9, 9, 7, 10, 9, 9, ...
## $ ps_calc_09 <int> 2, 2, 3, 1, 4, 4, 5, 2, 0, 1, 2, 1, 4, 1, 4, 3,...
## $ ps_calc_10 <int> 9, 7, 12, 13, 12, 12, 6, 8, 10, 11, 10, 4, 7, 1...
## $ ps_calc_11 <int> 1, 2, 4, 5, 4, 8, 2, 3, 5, 6, 2, 4, 3, 7, 6, 9,...
## $ ps_calc_12 <int> 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 2, 4, 0, 3, 0,...
## $ ps_calc_13 <int> 1, 3, 2, 0, 0, 4, 4, 4, 4, 6, 1, 7, 0, 5, 5, 2,...
## $ ps_calc_14 <int> 12, 10, 4, 5, 4, 9, 6, 9, 6, 10, 8, 8, 12, 9, 6...
## $ ps_calc_15_bin <int> 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ ps_calc_16_bin <int> 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0,...
## $ ps_calc_17_bin <int> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1,...
## $ ps_calc_18_bin <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1,...
## $ ps_calc_19_bin <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ ps_calc_20_bin <int> 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
We find:
sum(is.na(train))
## [1] 846458
sum(is.na(test))
## [1] 1270295
There are of the order of 1 million missing values per data set. This is not trivial.
We will turn the categorical features into factors and the binary ones into logical values. For the target variable we choose a factor format.
train <- train %>%
mutate_at(vars(ends_with("cat")), funs(factor)) %>%
mutate_at(vars(ends_with("bin")), funs(as.logical)) %>%
mutate(target = as.factor(target))
test <- test %>%
mutate_at(vars(ends_with("cat")), funs(factor)) %>%
mutate_at(vars(ends_with("bin")), funs(as.logical))
Here combine the train and test data frames for future homogeneous treatment.
combine <- bind_rows(train %>% mutate(dset = "train"),
test %>% mutate(dset = "test",
target = NA))
combine <- combine %>% mutate(dset = factor(dset))
We start our exploration with overview distribution plots for the various features. In order to make this visualisation more comprehensive, we will create layouts for the specific groups of features. For the sake of readability we divide each group into multiple parts.
These plots will be one of the pillars of our analysis. They might not be particularly exciting in themselves, but whenever we find an interesting effect in one of the variables we can come back here and examine their distribution. It’s always an advantage to start with a clear view of the parameter space.
p1 <- train %>%
ggplot(aes(ps_ind_06_bin, fill = ps_ind_06_bin)) +
geom_bar() +
theme(legend.position = "none")
p2 <- train %>%
ggplot(aes(ps_ind_07_bin, fill = ps_ind_07_bin)) +
geom_bar() +
theme(legend.position = "none")
p3 <- train %>%
ggplot(aes(ps_ind_08_bin, fill = ps_ind_08_bin)) +
geom_bar() +
theme(legend.position = "none")
p4 <- train %>%
ggplot(aes(ps_ind_09_bin, fill = ps_ind_09_bin)) +
geom_bar() +
theme(legend.position = "none")
p5 <- train %>%
ggplot(aes(ps_ind_10_bin, fill = ps_ind_10_bin)) +
geom_bar() +
theme(legend.position = "none")
p6 <- train %>%
ggplot(aes(ps_ind_11_bin, fill = ps_ind_11_bin)) +
geom_bar() +
theme(legend.position = "none")
p7 <- train %>%
ggplot(aes(ps_ind_12_bin, fill = ps_ind_12_bin)) +
geom_bar() +
theme(legend.position = "none")
p8 <- train %>%
ggplot(aes(ps_ind_13_bin, fill = ps_ind_13_bin)) +
geom_bar() +
theme(legend.position = "none")
layout <- matrix(c(1,2,3,4,5,6,7,8),2,4,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, p6, p7, p8, layout=layout)
Fig. 1
We find that some of the binary features are very unbalanced; with “FALSE” accounting for the vast majority of cases. This is particularly true for the ps_ind sequence from “10” to “13”.
p1 <- train %>%
ggplot(aes(ps_ind_16_bin, fill = ps_ind_16_bin)) +
geom_bar() +
theme(legend.position = "none")
p2 <- train %>%
ggplot(aes(ps_ind_17_bin, fill = ps_ind_17_bin)) +
geom_bar() +
theme(legend.position = "none")
p3 <- train %>%
ggplot(aes(ps_ind_18_bin, fill = ps_ind_18_bin)) +
geom_bar() +
theme(legend.position = "none")
p4 <- train %>%
ggplot(aes(ps_calc_15_bin, fill = ps_calc_15_bin)) +
geom_bar() +
theme(legend.position = "none")
p5 <- train %>%
ggplot(aes(ps_calc_16_bin, fill = ps_calc_16_bin)) +
geom_bar() +
theme(legend.position = "none")
p6 <- train %>%
ggplot(aes(ps_calc_17_bin, fill = ps_calc_17_bin)) +
geom_bar() +
theme(legend.position = "none")
p7 <- train %>%
ggplot(aes(ps_calc_18_bin, fill = ps_calc_18_bin)) +
geom_bar() +
theme(legend.position = "none")
p8 <- train %>%
ggplot(aes(ps_calc_19_bin, fill = ps_calc_19_bin)) +
geom_bar() +
theme(legend.position = "none")
p9 <- train %>%
ggplot(aes(ps_calc_20_bin, fill = ps_calc_20_bin)) +
geom_bar() +
theme(legend.position = "none")
layout <- matrix(c(1,2,3,4,5,6,7,8,9,9),2,5,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, p6, p7, p8, p9, layout=layout)
Fig. 2
We find that in this particular set of binary features we have more of a balance between “TRUE” and “FALSE”. For the three features ps_ind_16_bin, ps_calc_16_bin, and ps_calc_17_bin we find that the “TRUE” values are in fact dominating.
Note the logarithmic y-axes:
p1 <- train %>%
ggplot(aes(ps_ind_02_cat, fill = ps_ind_02_cat)) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none")
p2 <- train %>%
ggplot(aes(ps_ind_04_cat, fill = ps_ind_04_cat)) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none")
p3 <- train %>%
ggplot(aes(ps_ind_05_cat, fill = ps_ind_05_cat)) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none")
p4 <- train %>%
ggplot(aes(ps_car_01_cat, fill = ps_car_01_cat)) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none")
p5 <- train %>%
ggplot(aes(ps_car_02_cat, fill = ps_car_02_cat)) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none")
p6 <- train %>%
ggplot(aes(ps_car_03_cat, fill = ps_car_03_cat)) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none")
layout <- matrix(c(1,2,3,4,5,6),3,2,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, p6, layout=layout)
Fig. 3
We find that some categorical features have only very few levels, down to 2 levels (+ NA) for three of them. In others we have up to 11 levels, some of which are clearly dominating the (logarithmic) plots.
Note again the logarithmic y-axes:
p1 <- train %>%
ggplot(aes(ps_car_04_cat, fill = ps_car_04_cat)) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none")
p2 <- train %>%
ggplot(aes(ps_car_05_cat, fill = ps_car_05_cat)) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none")
p3 <- train %>%
ggplot(aes(ps_car_06_cat, fill = ps_car_06_cat)) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none")
p4 <- train %>%
ggplot(aes(ps_car_07_cat, fill = ps_car_07_cat)) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none")
p5 <- train %>%
ggplot(aes(ps_car_08_cat, fill = ps_car_08_cat)) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none")
p6 <- train %>%
ggplot(aes(ps_car_09_cat, fill = ps_car_09_cat)) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none")
p7 <- train %>%
ggplot(aes(ps_car_10_cat, fill = ps_car_10_cat)) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none")
p8 <- train %>%
ggplot(aes(ps_car_11_cat, fill = ps_car_11_cat)) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none")
layout <- matrix(c(1,1,2,3,4,4,5,5,6,6,7,7,8,8,8,8),4,4,byrow=TRUE)
multiplot(p1, p2, p4, p3, p5, p6, p7, p8, layout=layout)
Fig. 4
We find that also here the number of levels is mostly low. Recall that the car features are related to the automobile itself. Feature “11” has lots of levels. This is the one number shared by a (supposedly) categorical and an integer feature. Maybe there has been a mixup in naming these features?
The integer features for the “ind” and “car” groups are best visualised in a categorical-style barplot, because their ranges are not very large. We are using log axes for some.
p1 <- train %>%
mutate(ps_ind_01 = as.factor(ps_ind_01)) %>%
ggplot(aes(ps_ind_01, fill = ps_ind_01)) +
geom_bar() +
theme(legend.position = "none")
p2 <- train %>%
mutate(ps_ind_03 = as.factor(ps_ind_03)) %>%
ggplot(aes(ps_ind_03, fill = ps_ind_03)) +
geom_bar() +
theme(legend.position = "none")
p3 <- train %>%
mutate(ps_ind_14 = as.factor(ps_ind_14)) %>%
ggplot(aes(ps_ind_14, fill = ps_ind_14)) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none")
p4 <- train %>%
mutate(ps_ind_15 = as.factor(ps_ind_15)) %>%
ggplot(aes(ps_ind_15, fill = ps_ind_15)) +
geom_bar() +
theme(legend.position = "none")
p5 <- train %>%
mutate(ps_car_11 = as.factor(ps_car_11)) %>%
ggplot(aes(ps_car_11, fill = ps_car_11)) +
geom_bar() +
theme(legend.position = "none")
layout <- matrix(c(1,1,2,2,3,4,4,5),2,4,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, layout=layout)
Fig. 5
We find that again there are large differences in frequencies, in particular for ps_ind_14 and ps_car_11 where “0” and “3” are the dominating values, respectively.
Whereas most of the “calc” integer features can still be visualised best using barplots, for three of them a histogram is a better choice:
p1 <- train %>%
mutate(ps_calc_04 = as.factor(ps_calc_04)) %>%
ggplot(aes(ps_calc_04, fill = ps_calc_04)) +
geom_bar() +
theme(legend.position = "none")
p2 <- train %>%
mutate(ps_calc_05 = as.factor(ps_calc_05)) %>%
ggplot(aes(ps_calc_05, fill = ps_calc_05)) +
geom_bar() +
theme(legend.position = "none")
p3 <- train %>%
mutate(ps_calc_06 = as.factor(ps_calc_06)) %>%
ggplot(aes(ps_calc_06, fill = ps_calc_06)) +
geom_bar() +
theme(legend.position = "none")
p4 <- train %>%
mutate(ps_calc_07 = as.factor(ps_calc_07)) %>%
ggplot(aes(ps_calc_07, fill = ps_calc_07)) +
geom_bar() +
theme(legend.position = "none")
p5 <- train %>%
mutate(ps_calc_08 = as.factor(ps_calc_08)) %>%
ggplot(aes(ps_calc_08, fill = ps_calc_08)) +
geom_bar() +
theme(legend.position = "none")
p6 <- train %>%
mutate(ps_calc_09 = as.factor(ps_calc_09)) %>%
ggplot(aes(ps_calc_09, fill = ps_calc_09)) +
geom_bar() +
theme(legend.position = "none")
p7 <- train %>%
ggplot(aes(ps_calc_10, fill = ps_calc_10)) +
geom_histogram(fill = "blue", binwidth = 1) +
theme(legend.position = "none")
p8 <- train %>%
ggplot(aes(ps_calc_11, fill = ps_calc_11)) +
geom_histogram(fill = "blue", binwidth = 1) +
theme(legend.position = "none")
p9 <- train %>%
mutate(ps_calc_12 = as.factor(ps_calc_12)) %>%
ggplot(aes(ps_calc_12, fill = ps_calc_12)) +
geom_bar() +
theme(legend.position = "none")
p10 <- train %>%
mutate(ps_calc_13 = as.factor(ps_calc_13)) %>%
ggplot(aes(ps_calc_13, fill = ps_calc_13)) +
geom_bar() +
theme(legend.position = "none")
p11 <- train %>%
ggplot(aes(ps_calc_14, fill = ps_calc_14)) +
geom_histogram(fill = "blue", binwidth = 1) +
theme(legend.position = "none")
layout <- matrix(c(1,2,3,4,5,6,7,8,9,10,11,11),3,4,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11, layout=layout)
Fig. 6
We find that the histogram features “10”, “11”, and “14” have close to normal looking distributions with possibly more pronounced tails towards larger values. The other features are not far from a normal or log-normal distribution either and consequently display significant ranges in frequency.
For the floating point features we choose histograms to get a first impression of their distributions:
p1 <- train %>%
ggplot(aes(ps_reg_01, fill = ps_reg_01)) +
geom_histogram(fill = "dark green", binwidth = 0.1) +
theme(legend.position = "none")
p2 <- train %>%
ggplot(aes(ps_reg_02, fill = ps_reg_02)) +
geom_histogram(fill = "dark green", binwidth = 0.1) +
theme(legend.position = "none")
p3 <- train %>%
ggplot(aes(ps_reg_03, fill = ps_reg_03)) +
geom_histogram(fill = "dark green", binwidth = 0.1) +
theme(legend.position = "none")
p4 <- train %>%
ggplot(aes(ps_calc_01, fill = ps_calc_01)) +
geom_histogram(fill = "blue", binwidth = 0.1) +
theme(legend.position = "none")
p5 <- train %>%
ggplot(aes(ps_calc_02, fill = ps_calc_02)) +
geom_histogram(fill = "blue", binwidth = 0.1) +
theme(legend.position = "none")
p6 <- train %>%
ggplot(aes(ps_calc_03, fill = ps_calc_03)) +
geom_histogram(fill = "blue", binwidth = 0.1) +
theme(legend.position = "none")
layout <- matrix(c(1,2,3,4,5,6),2,3,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, p6, layout=layout)
Fig. 7
We find that while the (green) “reg” features show distributions that are clearly skewed toward a prominent peak, the (blue) “calc” features appear to be pretty uniformly distributed.
Also the second part of these features will be visualised using histograms:
p1 <- train %>%
ggplot(aes(ps_car_12, fill = ps_car_12)) +
geom_histogram(fill = "red", binwidth = 0.05) +
theme(legend.position = "none")
p2 <- train %>%
ggplot(aes(ps_car_13, fill = ps_car_13)) +
geom_histogram(fill = "red", binwidth = 0.1) +
theme(legend.position = "none")
p3 <- train %>%
ggplot(aes(ps_car_14, fill = ps_car_14)) +
geom_histogram(fill = "red", binwidth = 0.01) +
theme(legend.position = "none")
p4 <- train %>%
ggplot(aes(ps_car_15, fill = ps_car_15)) +
geom_histogram(fill = "red", binwidth = 0.1) +
theme(legend.position = "none")
layout <- matrix(c(1,2,3,4),2,2,byrow=TRUE)
multiplot(p1, p2, p3, p4, layout=layout)
Fig. 8
We find that the two features on the left show interesting sub-structure in their distributions, while ps_car_15 appears to take only quite distinct values until after ps_car_15 == 3when the gaps decrease notably.
Finally, this is what it is all about: whether a claim has been filed (“1”) or not (“0”):
train %>%
ggplot(aes(target, fill = target)) +
geom_bar() +
theme(legend.position = "none")
Fig. 9
We find:
train %>%
group_by(target) %>%
summarise(percentage = n()/nrow(train)*100)
## # A tibble: 2 x 2
## target percentage
## <fctr> <dbl>
## 1 0 96.355248
## 2 1 3.644752
With less than 4% of policy holders filing a claim the problem is heavily imbalanced.
Before proceeding with our analysis we will briefly examine the different combinations of missing values for the individual features in the training data. In this plot, the frequency of NAs per feature is shown in the bar plot at the top. The main part of the plot shows all the combinations where NAs occur in the same row for more than 1 feature. Thus, the more red rectangles a feature has the more features it shares NAs with.
train %>%
select(which(colMeans(is.na(.)) > 0)) %>%
aggr(prop = FALSE, combined = TRUE, numbers = TRUE, bars = FALSE, cex.axis = 0.7)
Fig. 9
We find:
The features ps_car_03_cat and ps_car_05_cat have the largest number of NAs. They also share numerous instances where NAs occur in both of them for the same row.
There are features that share a lot of NA rows with other features, for instance ps_reg_03. Others are exclusive, like ps_car_12, or almost exclusive like ps_car_11 or *ps_car_02.cat.
About 2.5% of values are missing in total in eacho of the train and test data sets:
sum(is.na(train))/(nrow(train)*ncol(train))*100
## [1] 2.410359
sum(is.na(test))/(nrow(test)*ncol(test))*100
## [1] 2.453096
In order to determine the importance of the individual parameters we will study the distribution of their claim rates, i.e. how large a fraction per categorical variable ended up claiming or how the distributions of claimed vs unclaimed compare. This analysis will follow a similar structure as above.
Here we estimate error bars from Binomial 95% confidence limits based on the count statistics of the claim vs no claim cases. These will help us to decide whether potential differences are significant or not. The corresponding helper function get_binCI is defined in Section 2.2 above.
p1 <- train %>%
group_by(ps_ind_06_bin, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(ps_ind_06_bin, frac_claim, fill = ps_ind_06_bin)) +
geom_col() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
theme(legend.position = "none") +
labs(y = "Claims [%]")
p2 <- train %>%
group_by(ps_ind_07_bin, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(ps_ind_07_bin, frac_claim, fill = ps_ind_07_bin)) +
geom_col() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
theme(legend.position = "none") +
labs(y = "Claims [%]")
p3 <- train %>%
group_by(ps_ind_08_bin, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(ps_ind_08_bin, frac_claim, fill = ps_ind_08_bin)) +
geom_col() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
theme(legend.position = "none") +
labs(y = "Claims [%]")
p4 <- train %>%
group_by(ps_ind_09_bin, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(ps_ind_09_bin, frac_claim, fill = ps_ind_09_bin)) +
geom_col() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
theme(legend.position = "none") +
labs(y = "Claims [%]")
p5 <- train %>%
group_by(ps_ind_10_bin, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(ps_ind_10_bin, frac_claim, fill = ps_ind_10_bin)) +
geom_col() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
theme(legend.position = "none") +
labs(y = "Claims [%]")
p6 <- train %>%
group_by(ps_ind_11_bin, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(ps_ind_11_bin, frac_claim, fill = ps_ind_11_bin)) +
geom_col() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
theme(legend.position = "none") +
labs(y = "Claims [%]")
p7 <- train %>%
group_by(ps_ind_12_bin, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(ps_ind_12_bin, frac_claim, fill = ps_ind_12_bin)) +
geom_col() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
theme(legend.position = "none") +
labs(y = "Claims [%]")
p8 <- train %>%
group_by(ps_ind_13_bin, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(ps_ind_13_bin, frac_claim, fill = ps_ind_13_bin)) +
geom_col() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
theme(legend.position = "none") +
labs(y = "Claims [%]")
layout <- matrix(c(1,2,3,4,5,6,7,8),2,4,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, p6, p7, p8, layout=layout)
Fig. 10
p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1; p5 <- 1; p6 <- 1; p7 <- 1; p8 <- 1
We find that the first batch of binary features has significant differences in the claim fractions especially for the 4 top panels. There are differences in the bottom panels but they are much less significant.
p1 <- train %>%
group_by(ps_ind_16_bin, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(ps_ind_16_bin, frac_claim, fill = ps_ind_16_bin)) +
geom_col() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
theme(legend.position = "none") +
labs(y = "Claims [%]")
p2 <- train %>%
group_by(ps_ind_17_bin, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(ps_ind_17_bin, frac_claim, fill = ps_ind_17_bin)) +
geom_col() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
theme(legend.position = "none") +
labs(y = "Claims [%]")
p3 <- train %>%
group_by(ps_ind_18_bin, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(ps_ind_18_bin, frac_claim, fill = ps_ind_18_bin)) +
geom_col() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
theme(legend.position = "none") +
labs(y = "Claims [%]")
p4 <- train %>%
group_by(ps_calc_15_bin, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(ps_calc_15_bin, frac_claim, fill = ps_calc_15_bin)) +
geom_col() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
theme(legend.position = "none") +
labs(y = "Claims [%]")
p5 <- train %>%
group_by(ps_calc_16_bin, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(ps_calc_16_bin, frac_claim, fill = ps_calc_16_bin)) +
geom_col() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
theme(legend.position = "none") +
labs(y = "Claims [%]")
p6 <- train %>%
group_by(ps_calc_17_bin, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(ps_calc_17_bin, frac_claim, fill = ps_calc_17_bin)) +
geom_col() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
theme(legend.position = "none") +
labs(y = "Claims [%]")
p7 <- train %>%
group_by(ps_calc_18_bin, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(ps_calc_18_bin, frac_claim, fill = ps_calc_18_bin)) +
geom_col() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
theme(legend.position = "none") +
labs(y = "Claims [%]")
p8 <- train %>%
group_by(ps_calc_19_bin, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(ps_calc_19_bin, frac_claim, fill = ps_calc_19_bin)) +
geom_col() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
theme(legend.position = "none") +
labs(y = "Claims [%]")
p9 <- train %>%
group_by(ps_calc_20_bin, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(ps_calc_20_bin, frac_claim, fill = ps_calc_20_bin)) +
geom_col() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
theme(legend.position = "none") +
labs(y = "Claims [%]")
layout <- matrix(c(1,2,3,4,5,6,7,8,9,9),2,5,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, p6, p7, p8, p9, layout=layout)
Fig. 11
p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1; p5 <- 1; p6 <- 1; p7 <- 1; p8 <- 1
We find that the 2nd part of the binary features have much smaller, statistically insignificant differences in claim rate on average. The only exceptions are ps_ind_16_bin and ps_ind_17_bin which are significant.
Note that the features are reordered by claims fraction; except for NA, which is always at the end:
p1 <- train %>%
group_by(ps_ind_02_cat, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(reorder(ps_ind_02_cat, -frac_claim, FUN = max), frac_claim, fill = ps_ind_02_cat)) +
geom_col() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
theme(legend.position = "none") +
labs(x = "ps_ind_02_cat", y = "Claims [%]")
p2 <- train %>%
group_by(ps_ind_04_cat, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(reorder(ps_ind_04_cat, -frac_claim, FUN = max), frac_claim, fill = ps_ind_04_cat)) +
geom_col() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
theme(legend.position = "none") +
labs(x = "ps_ind_04_cat", y = "Claims [%]")
p3 <- train %>%
group_by(ps_ind_05_cat, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(reorder(ps_ind_05_cat, -frac_claim, FUN = max), frac_claim, fill = ps_ind_05_cat)) +
geom_col() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
theme(legend.position = "none") +
labs(x = "ps_ind_05_cat", y = "Claims [%]")
p4 <- train %>%
group_by(ps_car_01_cat, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(reorder(ps_car_01_cat, -frac_claim, FUN = max), frac_claim, fill = ps_car_01_cat)) +
geom_col() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
theme(legend.position = "none") +
labs(x = "ps_car_01_cat", y = "Claims [%]")
p5 <- train %>%
group_by(ps_car_02_cat, target) %>%
count() %>%
spread(target, n, fill = 0) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(reorder(ps_car_02_cat, -frac_claim, FUN = max), frac_claim, fill = ps_car_02_cat)) +
geom_col() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
theme(legend.position = "none") +
labs(x = "ps_car_02_cat", y = "Claims [%]")
p6 <- train %>%
group_by(ps_car_03_cat, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(reorder(ps_car_03_cat, -frac_claim, FUN = max), frac_claim, fill = ps_car_03_cat)) +
geom_col() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
theme(legend.position = "none") +
labs(x = "ps_car_03_cat", y = "Claims [%]")
layout <- matrix(c(1,2,3,4,5,6),3,2,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, p6, layout=layout)
Fig. 12
p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1; p5 <- 1; p6 <- 1; p7 <- 1; p8 <- 1
We find that interestingly enough there appears to be a correlation with the claims rate and whether a certain feature is existing; as evidenced by the high fractions among the NAs for most features. Besides this strong effect there might by a more suble dependence on ps_ind_05_cat especially in “2” vs “0”.
p1 <- train %>%
group_by(ps_car_04_cat, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(reorder(ps_car_04_cat, -frac_claim, FUN = max), frac_claim, fill = ps_car_04_cat)) +
geom_col() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
theme(legend.position = "none") +
labs(x = "ps_car_04_cat", y = "Claims [%]")
p2 <- train %>%
group_by(ps_car_05_cat, target) %>%
count() %>%
spread(target, n, fill = 0) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(reorder(ps_car_05_cat, -frac_claim, FUN = max), frac_claim, fill = ps_car_05_cat)) +
geom_col() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
theme(legend.position = "none") +
labs(x = "ps_car_05_cat", y = "Claims [%]")
p3 <- train %>%
group_by(ps_car_06_cat, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(reorder(ps_car_06_cat, -frac_claim, FUN = max), frac_claim, fill = ps_car_06_cat)) +
geom_col() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
theme(legend.position = "none") +
labs(x = "ps_car_06_cat", y = "Claims [%]")
p4 <- train %>%
group_by(ps_car_07_cat, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(reorder(ps_car_07_cat, -frac_claim, FUN = max), frac_claim, fill = ps_car_07_cat)) +
geom_col() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
theme(legend.position = "none") +
labs(x = "ps_car_07_cat", y = "Claims [%]")
p5 <- train %>%
group_by(ps_car_08_cat, target) %>%
count() %>%
spread(target, n, fill = 0) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(reorder(ps_car_08_cat, -frac_claim, FUN = max), frac_claim, fill = ps_car_08_cat)) +
geom_col() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
theme(legend.position = "none") +
labs(x = "ps_car_08_cat", y = "Claims [%]")
p6 <- train %>%
group_by(ps_car_09_cat, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(reorder(ps_car_09_cat, -frac_claim, FUN = max), frac_claim, fill = ps_car_09_cat)) +
geom_col() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
theme(legend.position = "none") +
labs(x = "ps_car_09_cat", y = "Claims [%]")
p7 <- train %>%
group_by(ps_car_10_cat, target) %>%
count() %>%
spread(target, n, fill = 0) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(reorder(ps_car_10_cat, -frac_claim, FUN = max), frac_claim, fill = ps_car_10_cat)) +
geom_col() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
theme(legend.position = "none") +
labs(x = "ps_car_10_cat", y = "Claims [%]")
p8 <- train %>%
group_by(ps_car_11_cat, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(reorder(ps_car_11_cat, -frac_claim, FUN = max), frac_claim, fill = ps_car_11_cat)) +
geom_col() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "gray30") +
theme(legend.position = "none") +
labs(x = "ps_car_11_cat", y = "Claims [%]")
layout <- matrix(c(1,1,2,3,4,4,5,5,6,6,7,7,8,8,8,8),4,4,byrow=TRUE)
multiplot(p1, p2, p4, p3, p5, p6, p7, p8, layout=layout)
Fig. 12
p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1; p5 <- 1; p6 <- 1; p7 <- 1; p8 <- 1
We find that the second batch of categorical features also shows a few interesting features, such as ps_car_06_cat whereas other (like “10”) don’t seem to be related to the claims rate at all. We notice again that “11_cat” has far more levels than any other categorical feature
For the integer features we will now use a scatter plot with (95% confidence) error bars to emphasise the natural ordering of the values:
p1 <- train %>%
group_by(ps_ind_01, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(ps_ind_01, frac_claim)) +
geom_point(color = "orange") +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "orange") +
theme(legend.position = "none") +
labs(x = "ps_ind_01", y = "Claims [%]")
p2 <- train %>%
group_by(ps_ind_03, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(ps_ind_03, frac_claim)) +
geom_point(color = "orange") +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "orange") +
theme(legend.position = "none") +
labs(x = "ps_ind_03", y = "Claims [%]")
p3 <- train %>%
group_by(ps_ind_14, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(ps_ind_14, frac_claim)) +
geom_point(color = "orange") +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "orange") +
theme(legend.position = "none") +
labs(x = "ps_ind_14", y = "Claims [%]")
p4 <- train %>%
group_by(ps_ind_15, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(ps_ind_15, frac_claim)) +
geom_point(color = "orange") +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "orange") +
theme(legend.position = "none") +
labs(x = "ps_ind_15", y = "Claims [%]")
p5 <- train %>%
filter(!is.na(ps_car_11)) %>%
group_by(ps_car_11, target) %>%
count() %>%
spread(target, n, fill = 0) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(ps_car_11, frac_claim)) +
geom_point(color = "red") +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "red") +
theme(legend.position = "none") +
labs(x = "ps_car_11", y = "Claims [%]")
layout <- matrix(c(1,1,2,2,3,4,4,5),2,4,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, layout=layout)
Fig. 13
We find:
These variables show much more significant signal in their claim rates in the range of 2-3 percentage points.
ps_ind_03 and ps_car_11 have notable drops in claim number from their maxima which are both at zero.
The claims fraction for ps_ind_01 has an increasing trend, wherease ps_ind_15 is almost monotonically decreasing.
Note, that here we removed the NAs from the ps_car_11 feature, because they are only 5 values (with zero claims) and would make the plot much harder to read.
For features with a larger range of values we’re switching to overlapping density plots. Here we have removed a few values with low counts and very large error bars for about half of the features:
p1 <- train %>%
group_by(ps_calc_04, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(ps_calc_04, frac_claim)) +
geom_point(color = "blue") +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "blue") +
theme(legend.position = "none") +
labs(x = "ps_calc_04", y = "Claims [%]")
p2 <- train %>%
filter(ps_calc_05 < 6) %>%
group_by(ps_calc_05, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(ps_calc_05, frac_claim)) +
geom_point(color = "blue") +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "blue") +
theme(legend.position = "none") +
labs(x = "ps_calc_05", y = "Claims [%]")
p3 <- train %>%
filter(ps_calc_06 > 2) %>%
group_by(ps_calc_06, target) %>%
count() %>%
spread(target, n, fill = 0) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(ps_calc_06, frac_claim)) +
geom_point(color = "blue") +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "blue") +
theme(legend.position = "none") +
labs(x = "ps_calc_06", y = "Claims [%]")
p4 <- train %>%
filter(ps_calc_07 < 8) %>%
group_by(ps_calc_07, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(ps_calc_07, frac_claim)) +
geom_point(color = "blue") +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "blue") +
theme(legend.position = "none") +
labs(x = "ps_calc_07", y = "Claims [%]")
p5 <- train %>%
filter(ps_calc_08 > 2) %>%
group_by(ps_calc_08, target) %>%
count() %>%
spread(target, n, fill = 0) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(ps_calc_08, frac_claim)) +
geom_point(color = "blue") +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "blue") +
theme(legend.position = "none") +
labs(x = "ps_calc_08", y = "Claims [%]")
p6 <- train %>%
group_by(ps_calc_09, target) %>%
count() %>%
spread(target, n, fill = 0) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(ps_calc_09, frac_claim)) +
geom_point(color = "blue") +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "blue") +
theme(legend.position = "none") +
labs(x = "ps_calc_09", y = "Claims [%]")
p7 <- train %>%
ggplot(aes(ps_calc_10, fill = target)) +
geom_density(alpha = 0.5, bw = 0.4) +
theme(legend.position = "none")
p8 <- train %>%
ggplot(aes(ps_calc_11, fill = target)) +
geom_density(alpha = 0.5, bw = 0.4) +
theme(legend.position = "none")
p9 <- train %>%
filter(ps_calc_12 < 9) %>%
group_by(ps_calc_12, target) %>%
count() %>%
spread(target, n, fill = 0) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(ps_calc_12, frac_claim)) +
geom_point(color = "blue") +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "blue") +
theme(legend.position = "none") +
labs(x = "ps_calc_12", y = "Claims [%]")
p10 <- train %>%
filter(ps_calc_13 < 12) %>%
group_by(ps_calc_13, target) %>%
count() %>%
spread(target, n, fill = 0) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ggplot(aes(ps_calc_13, frac_claim)) +
geom_point(color = "blue") +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "blue") +
theme(legend.position = "none") +
labs(x = "ps_calc_13", y = "Claims [%]")
p11 <- train %>%
ggplot(aes(ps_calc_14, fill = target)) +
geom_density(alpha = 0.5, bw = 0.4)
layout <- matrix(c(1,2,3,4,5,6,7,8,9,10,11,11),3,4,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11, layout=layout)
Fig. 14
p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1; p5 <- 1; p6 <- 1; p7 <- 1; p8 <- 1; p9 <- 1; p10 <- 1; p11 <- 1
We find:
There is no statistically significant signal in the scatter plots. All error bars overlap in each plot and the slight wiggles that we see might very well just be random variation.
The density plots show no difference either. Each plot uses two overlapping colours with alpha blending, as shown in the legend on the bottom right, but you only see the blended result because the overlap is practically perfect.
We examine the floating point features using overlapping density plots only:
p1 <- train %>%
ggplot(aes(ps_reg_01, fill = target)) +
geom_density(alpha = 0.5, bw = 0.05) +
theme(legend.position = "none")
p2 <- train %>%
ggplot(aes(ps_reg_02, fill = target)) +
geom_density(alpha = 0.5, bw = 0.05) +
theme(legend.position = "none")
p3 <- train %>%
ggplot(aes(ps_reg_03, fill = target)) +
geom_density(alpha = 0.5, bw = 0.05) +
theme(legend.position = "none")
p4 <- train %>%
ggplot(aes(ps_calc_01, fill = target)) +
geom_density(alpha = 0.5, bw = 0.05) +
theme(legend.position = "none")
p5 <- train %>%
ggplot(aes(ps_calc_02, fill = target)) +
geom_density(alpha = 0.5, bw = 0.05) +
theme(legend.position = "none")
p6 <- train %>%
ggplot(aes(ps_calc_03, fill = target)) +
geom_density(alpha = 0.5, bw = 0.05)
layout <- matrix(c(1,2,3,4,5,6),2,3,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, p6, layout=layout)
Fig. 15
We find:
There are small but notable differences in the reg features, which are related to the region of the policy holder. Lower region values appear to show a lower number of claims on average.
The essentially uniform ps_calc_01 - ps_calc_03 features show almost perfect overlap. There might be some small differences but they could well be random.
p1 <- train %>%
ggplot(aes(ps_car_12, fill = target)) +
geom_density(alpha = 0.5, bw = 0.05) +
theme(legend.position = "none")
p2 <- train %>%
ggplot(aes(ps_car_13, fill = target)) +
geom_density(alpha = 0.5, bw = 0.05) +
theme(legend.position = "none")
p3 <- train %>%
ggplot(aes(ps_car_14, fill = target)) +
geom_density(alpha = 0.5, bw = 0.05) +
theme(legend.position = "none")
p4 <- train %>%
ggplot(aes(ps_car_15, fill = target)) +
geom_density(alpha = 0.5, bw = 0.1) +
theme(legend.position = "none")
layout <- matrix(c(1,2,3,4),2,2,byrow=TRUE)
multiplot(p1, p2, p3, p4, layout=layout)
Fig. 16
We find that the float car features also show small deviations towards higher numbers having more claims. While the effects might well be statistically significant their practical impact could be small.
In terms of the impact of the target on the individual features we clearly see strong differences in claim rates within and between the different groups. Some of the binary features show significant effects in the range of 1-2 percentage points, whereas others are show no practical impact. Specifically, most “ind” features have a clear influence claims whereas the calc binary features are neutral. The same appears to be true for the calc integer and floating point features; suggesting that the calc group in general is not of immediate usefulness for our prediction goal.
The strongest impact on claim rates is shown by the categorical and integer features; in particular the “ind” and “car” variables which can vary in the range of 2-3 percentage points. The effect on the floating point features is much more subtle, but might prove useful in getting the best prediction out of this data.
After studying each feature individually we will now start to look at interactions between them. The annoymity of the features will make it more difficult to interpret these relations. However, they will still be useful for our prediction goal and for gaining a more detailed understanding of our data.
We begin with a correlation matrix plot as a first comprehensive overview of our multi-parameter space. We will then take this overview plot as a starting point to investigate specific multi-feature comparisons in the following. Those examinations will likely result in more questions, which we can also examine (to a certain extend) in the same step.
What we will see here is the correlation coefficients for each combination of two features. In simplest terms: this shows whether two features are connected so that one changes with a predictable trend if you change the other. The closer this coefficient is to zero the weaker is the correlation. Both 1 and -1 are the ideal cases of perfect correlation and anti-correlation.
Here we will only include those features that we found to be useful in the previous step. Thereby, our plot will be easier to read and more informative. Note that for the purpose of this plot we will recode our binary features and categorical as integers. All rows with NAs are excluded.
train %>%
select(-starts_with("ps_calc"), -ps_ind_10_bin, -ps_ind_11_bin, -ps_car_10_cat, -id) %>%
mutate_at(vars(ends_with("cat")), funs(as.integer)) %>%
mutate_at(vars(ends_with("bin")), funs(as.integer)) %>%
mutate(target = as.integer(target)) %>%
cor(use="complete.obs", method = "spearman") %>%
corrplot(type="lower", tl.col = "black", diag=FALSE)
Fig. 17
In this kind of plot we want to look for the bright, large circles which immediately show the strong correlations (size and shading depends on the absolute values of the coefficients; colour depends on direction). Anything that you would have to squint to see is usually not worth seeing.
We find:
Most features appear to be primarily correlated with others in their group. We can see this by studying the upper right region near where the diagonal would be and comparing it to the lower left area of the plot.
There is no obvious correlation with the target feature in the left-most column. This could be caused by the sparsity of the target == 1 values.
Here we plot only the moderately to highly correlated features by showing their correlation coefficients directly:
train %>%
select(ps_ind_12_bin, ps_ind_14, ps_ind_16_bin, ps_ind_17_bin, ps_ind_18_bin, ps_reg_02,
ps_reg_03, ps_car_12, ps_car_13, ps_car_14, ps_car_15, ps_car_02_cat, ps_car_04_cat) %>%
mutate_at(vars(ends_with("cat")), funs(as.integer)) %>%
mutate_at(vars(ends_with("bin")), funs(as.integer)) %>%
cor(use="complete.obs", method = "spearman") %>%
corrplot(type="lower", tl.col = "black", diag=FALSE, method = "number")
Fig. 18
We find:
There is a very strong correlation between ps_ind_12_bin and ps_ind_14, which is an ordinal integer feature with 5 levels. Other correlations that exist are weaker but still notable.
The correlation between the “reg” and “car” features, respectively, shows how continuous variables are related. In particular ps_car_14, which showed only a small effect in the individual plots, might be interesting here.
The anti-correlations are generally not as strong, with ps_ind_16_bin and ps_ind_18_bin accounting for the strongest coefficient with -0.58.
For a different kind of visualisation of multi-feature interaction we can use an alluvial plot. Made available through the alluvial package, those plots are a kind of mix between a flow chart and a bar plot and they show how the target categories relate to various discrete features. Here we pick a subset of the more strongly correlated features we had identified in the previous plot:
allu_train <- train %>%
filter(!is.na(ps_car_02_cat)) %>%
filter(ps_car_04_cat %in% c("0","1","2","8","9")) %>%
group_by(target, ps_ind_17_bin, ps_ind_18_bin, ps_car_02_cat, ps_car_04_cat) %>%
count() %>%
ungroup
alluvial(allu_train %>% select(-n),
freq=allu_train$n, border=NA,
col=ifelse(allu_train$target == 0, "red", "blue"),
cex=0.75,
hide = allu_train$n < 200,
ordering = list(
order(allu_train$target==1),
NULL,
NULL,
NULL,
NULL))
Fig. 19
In this plot, the vertical sizes of the blocks and the widths of the stripes (called “alluvia”) are proportional to the frequency. We decided to hide the alluvia with the lowest frequencies using the parameter of the same name. Within a category block you can re-order the vertical layering of the alluvia to emphasise certain aspects of your data set.
We see nicely how the small proportion of positive target values (on the upper left) contributes to the make up of the different features. The dominating binary feature values can clearly be seen to carry an approximately proportional amount of claims. Feel free to fork this plot to explore the composition of other features.
Let’s add a little pinch of interactivity to our EDA recipe. Below we are using the R wrapper for the plotly toolkit which allows us to create plots we can examine from different angles and zoom stages. Here we use it to visualise the covariation between the three ps_car features (12, 13, 14) and its impact on the (colour-coded) target variable. Note, that we’re only using a subset of the data to keep the plot at a manageable size:
set.seed(4321)
train %>%
filter(!is.na(ps_car_12) & !is.na(ps_car_13) & !is.na(ps_car_14)) %>%
select(ps_car_12, ps_car_13, ps_car_14, target) %>%
sample_n(5e4) %>%
plot_ly(x = ~ps_car_12, y = ~ps_car_13, z = ~ps_car_14,
color = ~target, colors = c('#BF382A', '#0C4B8E'),
text = ~paste('ps_car_12:', ps_car_12,
'<br>ps_car_13:', ps_car_13,
'<br>ps_car_14:', ps_car_14,
'<br>Target:', target)
) %>%
add_markers() %>%
layout(title = "'Car' group correlations and target impact",
scene = list(xaxis = list(title = 'ps_car_12'),
yaxis = list(title = 'ps_car_13'),
zaxis = list(title = 'ps_car_14')))
Fig. 21
We find that the correlations can clearly be seen and are leading to a characteristic plane in the 3D parameter space. The target == 1 variables appear to be homogeneously dotted throughout this plane. Have a go a exploring the data through this plot :-)
Now we are putting all the correlated pieces together in a comprehensive, facetted overview plot for the main parameters in this section. These parameters don’t show an obvious connection in the correlation plot, but it is worth looking for more subtle interactions here.
We are plotting the distribution of ps_car_14 (which is correlated with ps_car_12/13) for the 5 ps_ind_14 classes (correlated with ps_ind_12_bin) using ridgeline plots through ggridges. Ridgeline plots allow for a quick comparison of overlapping (density) curves.
The we separate those overlapping curves by ps_ind_16_bin (correlated with ps_ind_17_bin and ps_ind_18_bin) horizontally and finally by target vertically. This is the result:
train %>%
ggplot(aes(ps_car_14, reorder(ps_ind_14, - ps_car_14, FUN = median), fill = as.factor(ps_ind_14))) +
geom_density_ridges() +
labs(y = "ps_ind_14") +
theme(legend.position = "none", plot.title = element_text(size=11)) +
facet_grid(ps_ind_16_bin ~ target) +
ggtitle("Target (left/right) vs ps_ind_16_bin (top/bottom) for ps_car_14 (x) vs ps_ind_14 (y, col)")
Fig. 22
We find:
Reading from left to right and top to bottom the number of populated levels of ps_ind_14 (i.e. number of colours) steadily decreases. That means that target == 1 & ps_ind_16_bin == TRUE only contain ps_ind_14 <= 1. In general, target == 1 (i.e. claims) contains fewer occupied levels of ps_ind_14 than target == 0 (no claims).
The distributions of ps_car_14 show notable differences for different ps_ind_14 levels troughout the grid. In particular ps_ind_14 == 0 is associated with significantly larger ps_car_14 values.
The ps_ind_14 levels from 1-3 are much more similar but also show subtle differences particularly towards larger values of ps_car_14. We have already seen above that level 4 is only sparsely populated.
Once we have sufficiently explored the connection within the existing features our next step will be to use these insights to build new features, derived from interactions or other characteristics of the data. As in some of my other kernels, the new features will be defined in the following single code block and then studied in the sub-sections below. This gives us the advantage of having everything in one place instead of dotted accross this section. This code block will grow as we engineer new features.
There is of course one problem here: How to do this engineering from anonymised features? Not knowing what our variables represent makes it more difficult to interpret their connections. Fortunately, we still have a couple of tricks up our sleeve. Let’s see how useful they are :-)
We work on the combined data frame to make sure that our train and test data are treated consistently.
# number of NAs
nano <- combine %>%
is.na() %>%
rowSums() %>%
as.integer()
# Sum up "ind" binary columns
bin_ind <- combine %>% select(ends_with("bin")) %>%
select(starts_with("ps_ind")) %>%
rowSums() %>%
as.integer()
# Sum up "calc" binary columns
bin_calc <- combine %>% select(ends_with("bin")) %>%
select(starts_with("ps_calc")) %>%
rowSums() %>%
as.integer()
# "ind" binary column differences per row
# (uses "rep" to create a combine-size data frame from the
# single reference row that we then subtract from all rows)
bins <- combine %>%
select(ends_with("bin")) %>%
select(starts_with("ps_ind")) %>%
mutate_all(funs(as.integer))
ref_bin <- bins %>% summarise_all(median, na.rm = TRUE)
ref_bin <- ref_bin[rep(1,nrow(combine)),]
diff_ind <- rowSums(abs(bins - ref_bin))
# "calc" binary column differences per row
bins <- combine %>%
select(ends_with("bin")) %>%
select(starts_with("ps_calc")) %>%
mutate_all(funs(as.integer))
ref_bin <- bins %>% summarise_all(median, na.rm = TRUE)
ref_bin <- ref_bin[rep(1,nrow(combine)),]
diff_calc <- rowSums(abs(bins - ref_bin))
# Apply changes to combine frame
combine <- combine %>%
mutate(nano = nano,
bin_ind = bin_ind,
bin_calc = bin_calc,
diff_ind = diff_ind,
diff_calc = diff_calc)
# Split into train vs test
train <- combine %>%
filter(dset == "train")
test <- combine %>%
filter(dset == "test")
We start with the number of NAs for each policy holder ID. Sometimes it’s just as important to take into account how much we don’t know about a certain user than work with the information we have. Below, we examine the distribution of NAs per ID throughout all features (i.e. per row for all columns) and the corresponding claim rates.
Note the logarithmic y-axis in the histogram. For the claim rates, we make use of a facet_zoom, via the ggforce package, to examine the observations with few NAs in greater detail. The zoomed area (on the left) is highlighted in the full plot (on the right) in a darker shade of grey:
p1 <- train %>%
ggplot(aes(nano, fill = as.factor(nano))) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none") +
labs(x = "Number of NAs")
p2 <- train %>%
group_by(nano, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ungroup() %>%
ggplot(aes(nano, frac_claim)) +
geom_point(color = "red") +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "orange") +
theme(legend.position = "none") +
labs(x = "Number of NAs", y = "Claims [%]") +
facet_zoom(x = nano < 5, y = frac_claim < 10)
layout <- matrix(c(1,2),2,1,byrow=TRUE)
multiplot(p1, p2, layout=layout)
Fig. 24
p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1; p5 <- 1; p6 <- 1; p7 <- 1; p8 <- 1; p9 <- 1; p10 <- 1; p11 <- 1
We find:
The number of NAs is predominantly low with most IDs having <= 2 NA entries (log axis). Interestingly, 2 NAs, not zero, is the most common number. Beyond that, the numbers drop quickly and there are only a few cases with more than 5 NAs. It is interesting to note that no ID has exactly 5 NAs, which might reflect the way our features are related. There is a small second peak at 7 NAs, which might also reveal some underlying structure in the data.
The claim rates show a very interesting pattern with a steady decline from 0 to 4 within the percentage range of 2-5%. For 6 and 7 NAs the rates are very high in comparison (means of 60% and 40%) and again there is a declining trend. For 8 NAs there are not enough cases to make a statistically significant statement.
The differences between 0-4 and 6-8 NAs suggests that we might have two distinct populations here which reflect differences in the policy holder’s characteristics. It will be well worth coming back to these distinctions during the course of our further analysis.
Next, we want to examine what happens if we sum up all the binary feature values of each group for the different IDs. There are two groups that have binary features: “ind” (the individual policy holder characteristics) and “calc” (calculated features). Here we show their distributions and claim rates:
p1 <- train %>%
ggplot(aes(bin_ind, fill = as.factor(bin_ind))) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none") +
labs(x = "Sum of binary 'ind' features")
p2 <- train %>%
filter(bin_ind < 6) %>%
group_by(bin_ind, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ungroup() %>%
ggplot(aes(bin_ind, frac_claim)) +
geom_point(color = "orange") +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "orange") +
theme(legend.position = "none") +
labs(x = "Sum of binary 'ind' features", y = "Claims [%]")
p3 <- train %>%
ggplot(aes(bin_calc, fill = as.factor(bin_calc))) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none") +
labs(x = "Sum of binary 'calc' features")
p4 <- train %>%
filter(bin_ind < 6) %>%
group_by(bin_calc, target) %>%
count() %>%
spread(target, n) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ungroup() %>%
ggplot(aes(bin_calc, frac_claim)) +
geom_point(color = "blue") +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "blue") +
theme(legend.position = "none") +
labs(x = "Sum of binary 'calc' features", y = "Claims [%]")
layout <- matrix(c(1,2,3,4),2,2,byrow=FALSE)
multiplot(p1, p2, p3, p4, layout=layout)
Fig. 25
p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1; p5 <- 1; p6 <- 1; p7 <- 1; p8 <- 1; p9 <- 1; p10 <- 1; p11 <- 1
We find:
Adding up the “ind” binary features results in a distribution which peaks at 2 and has a range of 1-6. There is no “0”, which means that at least one of the binary flags is always set. At the beginning of this kernel we saw that there are a total of 11 “ps_ind_??_bin” features but we only have a maximum of six set to “TRUE/1” at the same time. This is consistent with “FALSE/0” being the dominating value for most of these features.
For plotting the sum of binary “ind” features (bottom left) we see that there are only a few instances of 6 binary flags set (top left). This leads to a large statistical error bar that would dominate the plot and make it harder to read. Therefore, we remove binary sum == 6 from the bottom left plot. For the remaining sums there is a trend of increasing claim rate with increasing sum; but only “3” vs “1” or “2” is statistically significant. Still, this could be a useful new feature.
The “calc” binary features on the other hand display no significant variation in their claim rates; thus providing further evidence that the “calc” group might not be very useful for our prediction goal. Here the sum distribution also peaks at 2, but takes a minimum of 0 and also a maximum of 6 for in total 6 “calc” features. Toward larger sums the (log-scaled) distribution has less of a sharp decline than for the “ind” features.
After adding up the different binary values, our next feature will be a simple measure of how different certain binary observations are. We define a reference row, which corresponds to the median value of each binary feature. This definition was changed from the original one following constructive criticism by David J. Slate and Oscar Takeshita in the comments below. They were rightfully pointing out that this feature is not independent of the reference row.
We then subtract the binary values of this reference row from the set of binary values for each row. This determines all the bins that are different (i.e. are “0” where the reference row is “1” and vice versa). To properly count the differences we are summing up the absolute values of the resulting data frame, thereby adding a value of “-1” as “1” to our difference measure. Without the absolute values, a “-1” and a “1” would cancel out, but both indicate that there is a difference in the binary features and two differences should add up to “2”. This is the main difference between this feature and the binary sum feature above.
We do this for the “ind” and the “calc” binary features separately. Below are the resulting plots for the distribution and the claim rates for each distance measure. Note the square-root scaling in the y-axis of the bar plot:
p1 <- train %>%
ggplot(aes(as.factor(diff_ind), fill = as.factor(diff_ind))) +
geom_bar() +
scale_y_sqrt() +
labs(fill = "diff_ind", x = "Absolute difference of binary 'ind' features")
p2 <- train %>%
ggplot(aes(as.factor(diff_calc), fill = as.factor(diff_calc))) +
geom_bar() +
scale_y_sqrt() +
labs(fill = "diff_calc", x = "Absolute difference of binary 'calc' features")
p3 <- train %>%
filter(diff_ind < 7) %>%
group_by(diff_ind, target) %>%
count() %>%
spread(target, n, fill = 0) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ungroup() %>%
ggplot(aes(diff_ind, frac_claim)) +
geom_point(color = "orange") +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "orange") +
theme(legend.position = "none") +
labs(x = "Absolute difference of binary 'ind' features", y = "Claims [%]")
p4 <- train %>%
filter(diff_calc < 6) %>%
group_by(diff_calc, target) %>%
count() %>%
spread(target, n, fill = 0) %>%
mutate(frac_claim = `1`/(`1`+`0`)*100,
lwr = get_binCI(`1`,(`1`+`0`))[[1]]*100,
upr = get_binCI(`1`,(`1`+`0`))[[2]]*100
) %>%
ungroup() %>%
ggplot(aes(diff_calc, frac_claim)) +
geom_point(color = "blue") +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7, color = "blue") +
theme(legend.position = "none") +
labs(x = "Absolute difference of of binary 'calc' features", y = "Claims [%]")
layout <- matrix(c(1,2,3,4),2,2,byrow=TRUE)
multiplot(p1, p2, p3, p4, layout=layout)
Fig. 26
p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1; p5 <- 1; p6 <- 1; p7 <- 1; p8 <- 1; p9 <- 1; p10 <- 1; p11 <- 1
We find:
The differences of “ind” binary values are dominated by values 1 and 3. Interestingly, there are no values of zero, indicating that our median reference row does not actually occur in the data. We’re seeing a pattern in the claim rates suggesting that the claim fraction rises the further away we step from our reference row values. However, only the differences between values 1 & 2 vs 3 & 4 appear to be statistically significant.
In the “calc” binary features we see that we still haven’t managed to find a useful signal. There might be a tentative decrease in claim rate between the values of 2 vs 3, but everything is consistent within the uncertainties. Therefore, it appears that there is very little predictive power in the “calc” group.
Before we start building our prediction model we want to run a few sanity checks and get our data in shape for the model fitting. These preparations are the subject of this section.
One thing we want to make sure is that the train and test data do actually cover similar parts of the parameter space. There is very little use in having a feature with a strong target correlation in our training data if those crucial feature levels are absent in the test data.
First, we’ll prepare the data and define a new helper function. The following code block does most of the (relatively) heavy lifting in this subsection. We will outline its various steps as we explore the results.
bin_col <- combine %>% select(ends_with("bin")) %>%
colnames() %>% unlist(use.names = FALSE) %>% as.character()
cat_col <- combine %>% select(ends_with("cat")) %>%
colnames() %>% unlist(use.names = FALSE) %>% as.character()
train_frac_bin <- NULL
for (i in bin_col){
foo <- combine %>%
group_by(!!sym(i), dset) %>%
count() %>%
spread(dset, n, fill = 0) %>%
mutate(frac_train = train/(train+test)*100,
lwr = get_binCI(train,(train+test))[[1]]*100,
upr = get_binCI(train,(train+test))[[2]]*100
)
train_frac_bin <- tibble(name = i,
value = c(FALSE, TRUE),
tfrac = c(foo$frac_train),
lwr = c(foo$lwr),
upr = c(foo$upr)) %>%
bind_rows(train_frac_bin)
}
plot_cat_train_test <- function(col){
col <- enquo(col)
combine %>%
group_by(!!col, dset) %>%
count() %>%
spread(dset, n, fill = 0) %>%
mutate(frac_train = train/(train+test)*100,
lwr = get_binCI(train,(train+test))[[1]]*100,
upr = get_binCI(train,(train+test))[[2]]*100,
col = !!col
) %>%
ggplot(aes(col, frac_train)) +
geom_point() +
geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0.5, size = 0.7) +
labs(x = as.character(col)[2], y = "")
}
We begin with a comparison of the train vs test data in two different aspects: First the distribution of id values and second a closer look at the binary features in which we plot the relative fraction of train observations for each level:
p1 <- combine %>%
ggplot(aes(id, fill = dset)) +
geom_density(bw = 1, alpha = 0.5) +
coord_cartesian(ylim = c(7.3e-4,8.3e-4))
p2 <- train_frac_bin %>%
ggplot(aes(name, tfrac, color = value)) +
geom_point() +
geom_errorbar(aes(ymin = lwr, ymax = upr, color = value), width = 0.5, size = 0.7) +
labs(x = "Binary 'ind' features", y = "Training set percentage") +
theme(axis.text.x = element_text(angle=45, hjust=1, vjust=0.9))
layout <- matrix(c(1,2),2,1,byrow=TRUE)
multiplot(p1, p2, layout=layout)
Fig. 27
p1 <- 1; p2 <- 1
We find:
The id values appear to be pretty randomly distributed. There is no obvious id range where we find predominantly train or test data. We’re using a density plot here to copmare the relative fluctuations.
The binary feature levels are remarkably similar in terms of their train/(train+test) percentage. We see that on average the train data makes up about 40% of each feature and for each TRUE & FALSE level, respectively. Levels with larger deviation have only a few objects and therefore large errors. Everything is consistent within the uncertainties.
Next, we will have a look at the levels of the categorical features. Our approach will be the same as for the binary features, in that we compute the train/(train+test) percentage for each level and then plot everything on a comprehensive plot. This figure is rather busy and the important result is in the overall picture rather than the details:
#str_c("p",seq(1,length(cat_col))," <- plot_cat_train_test(",cat_col)
p1 <- plot_cat_train_test(ps_ind_02_cat)
p2 <- plot_cat_train_test(ps_ind_04_cat)
p3 <- plot_cat_train_test(ps_ind_05_cat)
p4 <- plot_cat_train_test(ps_car_01_cat)
p5 <- plot_cat_train_test(ps_car_02_cat)
p6 <- plot_cat_train_test(ps_car_03_cat)
p7 <- plot_cat_train_test(ps_car_04_cat)
p8 <- plot_cat_train_test(ps_car_05_cat)
p9 <- plot_cat_train_test(ps_car_06_cat)
p10 <- plot_cat_train_test(ps_car_07_cat)
p11 <- plot_cat_train_test(ps_car_08_cat)
p12 <- plot_cat_train_test(ps_car_09_cat)
p13 <- plot_cat_train_test(ps_car_10_cat)
p14 <- plot_cat_train_test(ps_car_11_cat)
layout <- matrix(seq(1,14),2,7,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11, p12, p13, p14, layout=layout)
Fig. 28
p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1; p5 <- 1; p6 <- 1; p7 <- 1; p8 <- 1; p9 <- 1; p10 <- 1
p11 <- 1; p12 <- 1; p13 <- 1; p14 <- 1
We find:
Once more, practically all levels appear to be consistent with a 40% train percentage, which is even more remarkable than for the binary features.
There are a few tentative outliers, e.g. in ps_ind_05_cat, but for so many levels that’s almost expected for the 95% confidence uncertainties that we are plotting.
Finally, we will also sample some of the floating-point features. Here we use overlapping density plots for the train vs test observations:
p1 <- combine %>%
ggplot(aes(ps_reg_01, fill = dset)) +
geom_density(alpha = 0.5, bw = 0.05) +
theme(legend.position = "none")
p2 <- combine %>%
ggplot(aes(ps_reg_02, fill = dset)) +
geom_density(alpha = 0.5, bw = 0.05) +
theme(legend.position = "none")
p3 <- combine %>%
ggplot(aes(ps_reg_03, fill = dset)) +
geom_density(alpha = 0.5, bw = 0.05) +
theme(legend.position = "none")
p4 <- combine %>%
ggplot(aes(ps_calc_01, fill = dset)) +
geom_density(alpha = 0.5, bw = 0.05) +
theme(legend.position = "none")
p5 <- combine %>%
ggplot(aes(ps_calc_02, fill = dset)) +
geom_density(alpha = 0.5, bw = 0.05) +
theme(legend.position = "none")
p6 <- combine %>%
ggplot(aes(ps_calc_03, fill = dset)) +
geom_density(alpha = 0.5, bw = 0.05)
layout <- matrix(c(1,2,3,4,5,6),2,3,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, p6, layout=layout)
Fig. 29
p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1; p5 <- 1; p6 <- 1; p7 <- 1; p8 <- 1; p9 <- 1; p10 <- 1; p11 <- 1
We find that for these features the overlap appears to be practically perfect. We’re using two different colours, with an alpha-mixing parameter of 0.5, but all we see is the blended overlap.
In summary, we find the train vs test data sets to be remarkably similar in the distributions of their features. The train/test split appears to be well executed and stratified by feature levels.
Not all features in our data set will be important. In particular, we have seen that the calc features appear to be of limited usefulness. Here we only include meaningful variables. The following code block, adapted from my recent Taxi competition kernel is designed to be easily modified to a new challenge:
# Feature formatting
combine <- combine %>%
mutate_at(vars(ends_with("cat")), funs(as.integer)) %>%
mutate_at(vars(ends_with("bin")), funs(as.integer)) %>%
mutate(target = as.integer(levels(target))[target])
# Specific definitions:
#---------------------------------
# predictor features
ind_cols <- c("ps_ind_01","ps_ind_02_cat","ps_ind_03","ps_ind_04_cat","ps_ind_05_cat","ps_ind_06_bin","ps_ind_07_bin","ps_ind_08_bin","ps_ind_09_bin","ps_ind_10_bin","ps_ind_11_bin","ps_ind_12_bin","ps_ind_13_bin","ps_ind_14","ps_ind_15","ps_ind_16_bin","ps_ind_17_bin","ps_ind_18_bin")
reg_cols <- c("ps_reg_01","ps_reg_02","ps_reg_03")
car_cols <- c("ps_car_01_cat","ps_car_02_cat","ps_car_03_cat","ps_car_04_cat","ps_car_05_cat","ps_car_06_cat","ps_car_07_cat","ps_car_08_cat","ps_car_09_cat","ps_car_10_cat","ps_car_11_cat","ps_car_11","ps_car_12","ps_car_13","ps_car_14","ps_car_15")
calc_cols <- c("ps_calc_01","ps_calc_02","ps_calc_03","ps_calc_04","ps_calc_05","ps_calc_06","ps_calc_07","ps_calc_08","ps_calc_09","ps_calc_10","ps_calc_11","ps_calc_12","ps_calc_13","ps_calc_14","ps_calc_15_bin","ps_calc_16_bin","ps_calc_17_bin","ps_calc_18_bin","ps_calc_19_bin","ps_calc_20_bin")
eng_cols <- c("nano", "bin_ind", "bin_calc", "diff_ind", "diff_calc")
train_cols <- c(ind_cols, reg_cols, car_cols, eng_cols)
# target feature
y_col <- c("target")
# identification feature
id_col <- c("id")
# auxilliary features
aux_cols <- c("dset")
#---------------------------------
# General extraction
#---------------------------------
# extract test id column
test_id <- combine %>%
filter(dset == "test") %>%
select(!!sym(id_col))
# all relevant columns for train/test
cols <- c(train_cols, y_col, aux_cols)
combine <- combine %>%
select_(.dots = cols)
# split train/test
train <- combine %>%
filter(dset == "train") %>%
select_(.dots = str_c("-",c(aux_cols)))
test <- combine %>%
filter(dset == "test") %>%
select_(.dots = str_c("-",c(aux_cols, y_col)))
#---------------------------------
The model evaluation metric here is the Normalised Gini Function. It has the advantage that only the relative order of the predicted probabilities is measured, not their absolute value. Thereby, your prediction is successful as long as it assigns low probabilities to true “no claims” observations and high probabilities to true “claims”. The “normalised” property means that probabilities are measured in the interval [0,1]. The metric then orders your predictions by probability and the more “claims” (or “no claims”) are among the high (or low) probabilities the better the result.
For a more extensive and intuitive explanation of the Gini metric I recommend this great kernel by kilian. There are numerous kernels already that deal with (efficiently) implementing this metric. We borrow our definition from this great work by Kevin (many thanks!):
xgb_normalizedgini <- function(preds, dtrain){
actual <- getinfo(dtrain, "label")
score <- NormalizedGini(preds,actual)
return(list(metric = "NormalizedGini", value = score))
}
In order to assess how well our model generalises we will perform a cross-validation step and also split our training data into a train vs validation data set. Thereby, the model performance can be evaluated on a data sample that the algorithm has not seen. We split our data into 80/20 fractions:
set.seed(4321)
trainIndex <- createDataPartition(train$target, p = 0.8, list = FALSE, times = 1)
train <- train[trainIndex,]
valid <- train[-trainIndex,]
Normally, we might want to include a cleaning step where we remove wildly unlikely data points (created by typos or other mistakes) that doesn’t teach us anything about our problem at hand. Here, however the data sets appear to be very well behaved and no cleaning is necessary.
We fit our model using XGBoost because it’s a robust and easy to use tool. In essence, it is a decision-tree gradient boosting algorithm with a high degree of popularity here on Kaggle. The parameters used here are not very sophisticated but should give you an idea how the algorithm works. I recommend to spend some time optimising the parameters and also to try out different tools presented in the other kernels.
A few brief words about gradient boosting, as far as I understand it: Boosting is what we call the step-by-step improvement of a weak learner (like a relatively shallow decision tree of max_depth levels) by successively applying it to the results of the previous learning step (for nrounds times in total). Gradient Boosting focusses on minimising the Loss Function (according to our evaluation metric) by training the algorithm on the gradient of this function. The method of Gradient Decent iteratively moves into the direction of the greatest decent (i.e. most negative first derivative) of the loss function. The step sizes can vary from iteration to iteration but has a multiplicative shrinkage factor eta in (0,1] associated with it for additional tuning. Smaller values of eta result in a slower decent and require higher nrounds.
In order for XGBoost to properly ingest our data samples we need to re-format them slightly:
#convert to XGB matrix
foo <- train %>% select(-target)
bar <- valid %>% select(-target)
dtrain <- xgb.DMatrix(as.matrix(foo),label = train$target)
dvalid <- xgb.DMatrix(as.matrix(bar),label = valid$target)
dtest <- xgb.DMatrix(as.matrix(test))
Now we define the meta-parameters that govern how XGBoost operates:
xgb_params <- list(colsample_bytree = 0.7, #variables per tree
subsample = 0.7, #data subset per tree
booster = "gbtree",
max_depth = 5, #tree levels
eta = 0.3, #shrinkage
eval_metric = xgb_normalizedgini,
objective = "reg:logistic",
seed = 4321,
nthread = -1
)
watchlist <- list(train=dtrain, valid=dvalid)
We first run a cross-validation (CV) step to measure our model’s performance and to determine how many iteration steps are optimal. To keep this stage short, within the limits of this kernel, I’m only running 50 iterations here (nrounds parameter). Ideally, you would want to use more.
We are using a 5-fold CV, which essentially means that we split our data in 5 parts and train on any 4 of them while evaluating on the remaining one. This is done iteratively, so that every fold becomes the validation fold. This is an efficient way to measure the performance of your fit on data it has not seen.
The parts below here are currently not running in the Kernel because of memory problems. I’m investigating. Sorry for that.
set.seed(1234)
xgb_cv <- xgb.cv(xgb_params,dtrain,early_stopping_rounds = 5, nfold = 5, nrounds=50, maximize = TRUE)
The number of the best iteration is now stored in a variable we can access for training our classifier for the optimum number of rounds. Here and above we also set a random seed to ensure reproducability:
set.seed(4321)
gb_dt <- xgb.train(params = xgb_params,
data = dtrain,
print_every_n = 5,
watchlist = watchlist,
nrounds = xgb_cv$best_iteration)
Be aware that this is not a competitive model. Other (engineered) features can be included, the meta-parameters can be optimised, and a larger amount of training steps can be run. In addition, several kernels have already shown the power of ensembling different individual models into a combined prediction that shows better results.
What this basic model can do is to serve as an example on how to construct your own prediction and give you a place to start from. Make sure to check out other modelling kernels as well to get a broader overview.
After training we will check which features are the most important for our model. This can provide the starting point for an iterative process where we identify, step by step, the significant features for a model. Here we will simply visualise these features:
imp_matrix <- as.tibble(xgb.importance(feature_names = colnames(train %>% select(-target)), model = gb_dt))
imp_matrix %>%
ggplot(aes(reorder(Feature, Gain, FUN = max), Gain, fill = Feature)) +
geom_col() +
coord_flip() +
theme(legend.position = "none") +
labs(x = "Features", y = "Importance")
We find:
As pointed out in several other modelling kernels, the features ps_car_13 and ps_reg_03 are the most important ones for our tentative prediction model. Also note the two ps_ind features on 3rd and 4th rank.
In addition, our engineered features are not doing bad either. diff_ind is ranked the 5th most significant feature, and nano still has some impact, too. diff_calc, bin_calc, and bin_ind are towards the end of the list but (not only) in this competition small improvements can have a large impact on the resulting leaderboard score.
Those results are preliminary, since the model performance can certainly be optimised, but they already show that feature engineering is a promising way of improving your prediction.
Once we are reasonably happy with our CV score we use the corresponding model to make a prediction for the test data, write the submission file, and submit it to the Kaggle leaderboard. This gives a performance score based on an independent data set. For this very reason, it is advisable to aim to keep the number of leaderboard submission low. Otherwise you run the risk to have your model influenced too much by this public leaderboard data which will result in overfitting. It’s better to trust your local CV.
Practically, this pre-ulimate code block will apply the XGBoost model to the testing data and write a submission file according to the competition specification. You will then find the file submit.csv in the “Output” tab of the Kaggle kernel or, if you have run the script locally, in the source directory.
pred <- test_id %>%
mutate(target = predict(gb_dt, dtest))
pred %>% write_csv('submit.csv')
All that’s left to do is to double-check this submission file to make sure that we are not wasting a (potentially) valuable submission with a mal-formatted output file. We are comparing its shape and content with the sample submission file that we read in at the very beginning, and also plot the distribution of the predicted values compared to the training values for an approximate comparison:
identical(dim(sample_submit),dim(pred))
glimpse(sample_submit)
glimpse(pred)
bind_cols(sample_submit, pred) %>% head(5)
Prediction file format successfully verified.
And that’s it! Many thanks for reading this kernel! I hope that it will benefit your own analysis and I wish you the best of success.
Have fun!